overview
We will discuss a number of approaches and methods we may need to use to merge datasets.
We will be using the IFNB-Stimulated and Control PBMCs from Seurat Data. The easiest way to get this data is from their custom package SeuratData which is hosted on GitHub.
We will quickly show you how to get this data, but it isn’t necessary to run these steps.
devtools::install_github("satijalab/seurat-data")library(Seurat)
library(SeuratData)
InstallData("ifnb")
LoadData("ifnb")## An object of class Seurat
## 14053 features across 13999 samples within 1 assay
## Active assay: RNA (14053 features, 0 variable features)
head(ifnb, 2)## orig.ident nCount_RNA nFeature_RNA stim seurat_annotations
## AAACATACATTTCC.1 IMMUNE_CTRL 3017 877 CTRL CD14 Mono
## AAACATACCAGAAA.1 IMMUNE_CTRL 2481 713 CTRL CD14 Mono
This dataset is already loaded in as a Seurat object, and it is already merged. So we need to split it, so we can merge it ourselves! The groups are in the “stim” column of the metadata.
table(ifnb$stim)
ifnb_list <- Seurat::SplitObject(ifnb, split.by = "stim")
ifnb_listWe have the ifnb_list save in an RData object, which you can load in.
load("data/seuOBJ_IFNB_splitByStim.RData")We are going to try out a few merging approaches. We want to wrap some of our analysis steps into a function to simplify rerunning things.
Normalization: * Log normalization with scale factor = 10,000 * Find Variable features with vst, select top 2000 variable features
data_proc <- function(seu) {
seu <- NormalizeData(seu, normalization.method = "LogNormalize", scale.factor = 10000)
seu <- FindVariableFeatures(seu, select.method = "vst", nfeatures = 2000)
return(seu)
}Make clusters: * Scale data with ScaleData() * Principle Component Analysis by using RunPCA() with npcs=30 PCs * Make non-linear dimensional reduction in UMAP by using RunUMAP() with dims=1:10 * Estimate Neighbors by using FindNeighbors() with dims=1:10 * Identify clusters with FindClusters() by using resolution=0.5
quick_clust <- function(seu) {
set.seed(1001)
seu <- ScaleData(seu, verbose = FALSE)
seu <- RunPCA(seu, npcs = 30, verbose = FALSE)
seu <- RunUMAP(seu, reduction = "pca", dims = 1:10, verbose = FALSE)
seu <- FindNeighbors(seu, reduction = "pca", dims = 1:10, verbose = FALSE)
seu <- FindClusters(seu, resolution = 0.5, verbose = FALSE)
return(seu)
}We can use merge() function to integrate our two data sets, as they are. We need to provide Group information and a name for the project as arguments.
ifnb_merge <- merge(ifnb_list$CTRL, ifnb_list$STIM, add.cell.ids = c("CTRL", "STIM"),
project = "ifnb_seuMerge")
head(ifnb_merge, 2)## orig.ident nCount_RNA nFeature_RNA stim
## CTRL_AAACATACATTTCC.1 IMMUNE_CTRL 3017 877 CTRL
## CTRL_AAACATACCAGAAA.1 IMMUNE_CTRL 2481 713 CTRL
## seurat_annotations
## CTRL_AAACATACATTTCC.1 CD14 Mono
## CTRL_AAACATACCAGAAA.1 CD14 Mono
We can use our processing and clustering functions to analyse our merged dataset.
ifnb_merge <- data_proc(ifnb_merge)
ifnb_merge <- quick_clust(ifnb_merge)Our UMAP shows our cells are distinct, depending on the condition.
DimPlot(ifnb_merge, group.by = "stim", pt.size = 0.2)A given cell type should often be clustered together. This pattern indicates the opposite. Different cell types are split into distinct groups depending on the sample.
DimPlot(ifnb_merge, group.by = "seurat_annotations", pt.size = 0.2, split.by = "stim")Reciprocal PCA minimizes the batch effects while merging different data sets.
This workflow is modified from canonical correlation analysis (CCA), which is widely used to merge batches.
First, we prepare the data for integration. We will normalize the data sets separately. Than, we need to identify features for integration. This is similar to the VariableFeatures function we ran on a single dataset. Lastly we run scaling and PCA, using these features.
ifnb_list_rpca <- lapply(ifnb_list, data_proc)
feats <- SelectIntegrationFeatures(ifnb_list_rpca)
ifnb_list <- lapply(ifnb_list, function(seu, feats) {
seu <- ScaleData(seu, features = feats, verbose = FALSE)
seu <- RunPCA(seu, features = feats, verbose = FALSE)
return(seu)
}, feats)We can then identify anchors. These are the features through which we will integrate our data. Once we have these features, we can then integrate our data sets together.
anchors <- FindIntegrationAnchors(ifnb_list, anchor.features = feats, reduction = "rpca")
ifnb_merge <- IntegrateData(anchorset = anchors)
ifnb_merge## An object of class Seurat
## 16053 features across 13999 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 1 other assay present: RNA
To evaluate how well the merge has worked we must check the clustering. Again we must scale, and then use our quick_clust function.
We can now see that our two data sets overlay with each other.
ifnb_merge <- ScaleData(ifnb_merge)
ifnb_merge <- quick_clust(ifnb_merge)
DimPlot(ifnb_merge, group.by = "stim", pt.size = 0.2)We can now see that our two data sets overlay with each other.
We can check the numbers in each cluster. Broadly, there are similar numbers per cluster now.
tbl <- table(ifnb_merge$stim, ifnb_merge$seurat_clusters)
barplot(tbl, beside = T, main = "Cell numbers in each cluster of each group")We can also check the cell types. Using UMAPs we can split and compare across our conditions and cell types.
DimPlot(ifnb_merge, group.by = "seurat_annotations", split.by = "stim", pt.size = 0.2)Using heatmaps we can also check how specific each cluster is to each cell type.
library(pheatmap)
tbl <- table(ifnb_merge$seurat_clusters, ifnb_merge$seurat_annotations)
pheatmap(tbl, scale = "column")Mutual Nearest Neighbors (MNN) approach uses paired cells instead of anchor genes to find the difference between batches.
The MNN correction was published by Marioni et al, Nature (2018). link
First we have to convert Seurat object to SingleCellExperiment object.
sce_list <- lapply(ifnb_list, function(seu) {
sce <- as.SingleCellExperiment(seu, assay = "RNA")
rowData(sce)$SYMBOL <- rownames(sce)
return(sce)
})
sce_list## $CTRL
## class: SingleCellExperiment
## dim: 14053 6548
## metadata(0):
## assays(2): counts logcounts
## rownames(14053): AL627309.1 RP11-206L10.2 ... AP001062.7 LRRC3DN
## rowData names(1): SYMBOL
## colnames(6548): AAACATACATTTCC.1 AAACATACCAGAAA.1 ... TTTGCATGCTTCGC.1
## TTTGCATGGTCCTC.1
## colData names(6): orig.ident nCount_RNA ... seurat_annotations ident
## reducedDimNames(1): PCA
## mainExpName: RNA
## altExpNames(0):
##
## $STIM
## class: SingleCellExperiment
## dim: 14053 7451
## metadata(0):
## assays(2): counts logcounts
## rownames(14053): AL627309.1 RP11-206L10.2 ... AP001062.7 LRRC3DN
## rowData names(1): SYMBOL
## colnames(7451): AAACATACCAAGCT.1 AAACATACCCCTAC.1 ... TTTGCATGCTAAGC.1
## TTTGCATGGGACGA.1
## colData names(6): orig.ident nCount_RNA ... seurat_annotations ident
## reducedDimNames(1): PCA
## mainExpName: RNA
## altExpNames(0):
As with RPCA we need to model and identify highly variable genes. For this we will use the scran functions modelGeneVar() and getTopHVGs(). We simply provide our SingleCellExperiment object.
library(scran)
dec_list <- lapply(sce_list, modelGeneVar)
hvgc_list <- lapply(sce_list, getTopHVGs, prop = 0.1)Next we will find the features that are shared between samples. We must first define the shared “universe” of genes between our samples, and subset our variable genes to these. We can then combine the the variance to find variable features in both data sets.
universe <- intersect(rownames(sce_list$CTRL), rownames(sce_list$STIM))
sce_list <- lapply(sce_list, function(sce, universe) {
sce <- sce[universe, ]
return(sce)
}, universe)
dec_list <- lapply(dec_list, function(dec, universe) {
dec <- dec[universe, ]
return(dec)
}, universe)
combined_dec <- combineVar(dec_list$CTRL, dec_list$STIM)
chosen_hvgs <- combined_dec$bio > 0We will use the batchelor package to run MNN with the fastMNN() function. * d: number of principles evaluated * k: number of nearest neighbors to consider * subset.row: subset genes. Here, we are using the top variable features
library(batchelor)
mnn_res <- fastMNN(CTRL = sce_list$CTRL, STIM = sce_list$STIM, d = 50, k = 20, subset.row = chosen_hvgs)
mnn_res## class: SingleCellExperiment
## dim: 1958 13999
## metadata(2): merge.info pca.info
## assays(1): reconstructed
## rownames(1958): HES4 ISG15 ... CTD-2521M24.5 AJ006998.2
## rowData names(1): rotation
## colnames(13999): AAACATACATTTCC.1 AAACATACCAGAAA.1 ... TTTGCATGCTAAGC.1
## TTTGCATGGGACGA.1
## colData names(1): batch
## reducedDimNames(1): corrected
## mainExpName: NULL
## altExpNames(0):
The resulting SingleCellExperiment object contains the batch information. We can also retrieve a matrix of our dimension reduction results and corrected log counts with the reducedDim() and assay() functions.
table(mnn_res$batch)##
## CTRL STIM
## 6548 7451
reducedDim(mnn_res, "corrected")[1:2, ]## [,1] [,2] [,3] [,4] [,5]
## AAACATACATTTCC.1 0.4692084 0.002558405 0.24112457 -0.01932945 -0.02919753
## AAACATACCAGAAA.1 0.4849751 -0.221433858 -0.03194568 -0.04457300 0.02085705
## [,6] [,7] [,8] [,9] [,10]
## AAACATACATTTCC.1 0.0001116506 -0.05219881 -0.01560542 -0.018456594 -0.02042355
## AAACATACCAGAAA.1 -0.0208653692 0.02518229 -0.05640069 0.001590785 0.02042924
## [,11] [,12] [,13] [,14]
## AAACATACATTTCC.1 -0.008021795 -0.04396845 0.005131835 0.01141103
## AAACATACCAGAAA.1 -0.006484712 0.02690261 -0.014495238 -0.01603657
## [,15] [,16] [,17] [,18]
## AAACATACATTTCC.1 -0.0002610275 -0.001044311 0.02583134 0.004660613
## AAACATACCAGAAA.1 0.0417677345 0.002738321 -0.01539083 -0.028006127
## [,19] [,20] [,21] [,22] [,23]
## AAACATACATTTCC.1 -0.003453839 0.007833436 0.004925768 0.01688819 -0.01693241
## AAACATACCAGAAA.1 0.010241154 0.016709708 -0.019963263 -0.02561860 0.01056931
## [,24] [,25] [,26] [,27]
## AAACATACATTTCC.1 -0.003615097 -0.007882785 -0.01073247 -0.007515978
## AAACATACCAGAAA.1 0.010869553 -0.005000099 0.02215598 -0.000877394
## [,28] [,29] [,30] [,31] [,32]
## AAACATACATTTCC.1 -0.006422341 0.004441876 -0.002957034 -0.01006521 0.002491115
## AAACATACCAGAAA.1 0.003697462 0.014406622 0.003685268 0.01067559 0.021875399
## [,33] [,34] [,35] [,36]
## AAACATACATTTCC.1 -0.001875363 -0.008862148 -0.002032289 0.008468607
## AAACATACCAGAAA.1 0.011806194 0.003560579 0.002324063 -0.002952769
## [,37] [,38] [,39] [,40]
## AAACATACATTTCC.1 -0.007954117 0.0009951654 -0.002792572 0.002435518
## AAACATACCAGAAA.1 -0.006034000 0.0009532853 0.002280508 0.010147309
## [,41] [,42] [,43] [,44]
## AAACATACATTTCC.1 0.015530208 -0.0009633948 0.0026935382 0.0002299324
## AAACATACCAGAAA.1 -0.003641573 0.0034695645 -0.0008694543 0.0037271479
## [,45] [,46] [,47] [,48]
## AAACATACATTTCC.1 -0.003499860 -0.001978646 -0.009881689 0.004610468
## AAACATACCAGAAA.1 -0.002181879 -0.001995283 -0.005362725 -0.001792342
## [,49] [,50]
## AAACATACATTTCC.1 -0.001164966 -0.0095759157
## AAACATACCAGAAA.1 0.005418251 0.0003605282
assay(mnn_res, "reconstructed")[1:2, 1:5]## <2 x 5> LowRankMatrix object of type "double":
## AAACATACATTTCC.1 AAACATACCAGAAA.1 AAACATACCTCGCT.1 AAACATACCTGGTA.1
## HES4 -0.001084752 -0.001208987 -0.001217453 0.001368413
## ISG15 -0.112620920 -0.128283650 -0.111386942 -0.138798526
## AAACATACGATGAA.1
## HES4 -0.001118792
## ISG15 -0.129904486
Make UMAP using the scater package.
library(scater)
set.seed(1001)
mnn_res <- runUMAP(mnn_res, dimred = "corrected")
mnn_res$batch <- factor(mnn_res$batch)
plotUMAP(mnn_res, colour_by = "batch")We will cluster using SNN graph approach from scran as this is comptaible with our SingleCellExperiment object.
snn.gr <- buildSNNGraph(mnn_res, use.dimred = "corrected")
cluster_mnn <- igraph::cluster_louvain(snn.gr)$membership
table(cluster_mnn)## cluster_mnn
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 656 1455 1139 432 964 1052 769 786 936 2043 1299 44 2368 56
Lets check the UMAP for our clustered results.
mnn_res$cluster <- factor(cluster_mnn)
plotUMAP(mnn_res, colour_by = "cluster")We can also check how many cells from each cluster belong to each group. With this approach you can see that there is a lot more group-specific clusters.
tbl <- table(mnn_res$batch, mnn_res$cluster)
barplot(tbl, beside = T, main = "Cell numbers in each cluster of each group")We must annotate our SingleCellExperiment object with different cell type information. We can then visualize the cell types on our UMAP.
cellType <- lapply(sce_list, function(x) {
res <- setNames(as.character(colData(x)$seurat_annotations), colnames(x))
return(res)
})
cell_type <- c(cellType$CTRL, cellType$STIM)
mnn_res$cell_type <- cell_type[match(rownames(colData(mnn_res)), names(cell_type))]
mnn_res$cell_type <- factor(mnn_res$cell_type)plotUMAP(mnn_res, colour_by = "cell_type")We can also use a heatmap to look at the specificity of each cell type to each cluster. Most are cluster-specific.
tbl <- table(mnn_res$cluster, mnn_res$cell_type)
pheatmap(tbl, scale = "column")Performance is dependent on experimental context.
Generally: * RPCA - More homogeneous
* MNN - More heterogeneous
We can prepare for Harmony in much the same way as we prepare for the simple Seurat merge: merge, normalize, scale, PCA and UMAP.
seu_obj <- merge(ifnb_list$CTRL, ifnb_list$STIM)
seu_obj <- data_proc(seu_obj)
seu_obj <- ScaleData(seu_obj)
seu_obj <- RunPCA(seu_obj)
seu_obj <- RunUMAP(seu_obj, reduction = "pca", dims = 1:10, reduction.name = "umap")As you can see we are back with our completely seperate groups.
DimPlot(seu_obj)We can use the RunHarmony() function to implement the Harmony correction.
library(harmony)
seu_obj <- RunHarmony(seu_obj, group.by.vars = "stim", assay.use = "RNA")
seu_obj <- RunUMAP(seu_obj, reduction = "harmony", dims = 1:10, reduction.name = "umap_harmony")
DimPlot(seu_obj, reduction = "umap_harmony")overview
To annotate the Single-cell data sets, we can evaluate the gene expression pattern of well known cell-type specific marker genes and make a manual annotation, like we did in section II. Here, we will introduce two more strategies to make cell type annotations automatically: 1. Mapping and annotating query datasets with Seurat using a reference data set. link 2. Make annotation with SingleR link
The demo data, panc8, was fetched by using SeuratData(). It contains single-cell RNA-Seq of human pancreatic islet sequenced with different techniques.
The demo data contains * CELSeq dataset1 (GSE81076), as reference * CELSeq dataset2 (GSE85241), as reference * SMART-Seq2 (E-MTAB-5061), as reference * Fluidigm C1 (GSE86469), as query
As we did before we will split this dataset. We have also prepped this dataset for you, if you want to load the data object in directly.
library(Seurat)
library(SeuratData)
InstallData("panc8")
data("panc8")Split up the datasets by techniques using SplitObject().
seu_list <- SplitObject(panc8, split.by = "tech")
names(seu_list) <- c("celseq1", "celseq2", "smartseq", "fluidigmc1", "indrop")load("data/panc8.RData")For this we need a reference dataset and a query dataset. Merge celseq1 and celseq2 data with RPCA to make the reference.
seu_list <- lapply(seu_list, data_proc)
ref_list <- seu_list[c("celseq1", "celseq2")]
feats <- SelectIntegrationFeatures(ref_list)
ref_list <- lapply(ref_list, function(seu, feats) {
seu <- ScaleData(seu, features = feats, verbose = FALSE)
seu <- RunPCA(seu, features = feats, verbose = FALSE)
return(seu)
}, feats)Merge celseq1 and celseq2 data with RPCA
anchors <- FindIntegrationAnchors(ref_list, anchor.features = feats, reduction = "rpca")
ref_panc <- IntegrateData(anchorset = anchors)
ref_panc <- ScaleData(ref_panc)
ref_panc <- quick_clust(ref_panc)How does the data look?
DimPlot(ref_panc, group.by = "seurat_clusters", split.by = "tech", pt.size = 0.2,
label = TRUE)panc_list <- list(ref = ref_panc, query = seu_list$smartseq)
feats <- SelectIntegrationFeatures(panc_list)
panc_list <- lapply(panc_list, function(seu, feats) {
seu <- ScaleData(seu, features = feats, verbose = FALSE)
seu <- RunPCA(seu, features = feats, verbose = FALSE)
return(seu)
}, feats)Identify the anchors between reference and query data sets, using FindTransferAnchors(). These are essential to transfer information from our reference to our query. We can then transfer the cell type information. For each cell in query dataset, the score for each given cell type was estimated by the gene expression pattern of anchor genes using the TransferData() function.
anchors <- FindTransferAnchors(reference = panc_list$ref, query = panc_list$query,
dims = 1:30, reference.reduction = "pca")
pred_res <- TransferData(anchorset = anchors, refdata = panc_list$ref$celltype)
head(pred_res, 2)## predicted.id prediction.score.gamma prediction.score.acinar
## AZ_A2 gamma 0.988556270 0
## AZ_B9 alpha 0.004147972 0
## prediction.score.alpha prediction.score.delta prediction.score.beta
## AZ_A2 0.01144373 0.000000000 0.000000
## AZ_B9 0.85757406 0.009551014 0.128727
## prediction.score.ductal prediction.score.endothelial
## AZ_A2 0 0
## AZ_B9 0 0
## prediction.score.activated_stellate prediction.score.schwann
## AZ_A2 0 0
## AZ_B9 0 0
## prediction.score.mast prediction.score.macrophage
## AZ_A2 0 0
## AZ_B9 0 0
## prediction.score.epsilon prediction.score.quiescent_stellate
## AZ_A2 0 0
## AZ_B9 0 0
## prediction.score.max
## AZ_A2 0.9885563
## AZ_B9 0.8575741
The cell type with highest score was assigned to the given cell. We can visualize this score with a heatmap.
mat <- as.matrix(pred_res[, -c(1, 15)])
colnames(mat) <- gsub("prediction.score.", "", colnames(mat))
pheatmap::pheatmap(mat, scale = "row", show_rownames = FALSE)pred_cellType <- setNames(pred_res$predicted.id, rownames(pred_res))
panc_list$query[["cellType_predBySeurat"]] <- pred_cellType[match(Cells(panc_list$query),
names(pred_cellType))]
head(panc_list$query, 2)## orig.ident nCount_RNA nFeature_RNA tech replicate assigned_cluster
## AZ_A2 AZ 384801 3611 smartseq2 smartseq2 <NA>
## AZ_B9 AZ 654549 4433 smartseq2 smartseq2 <NA>
## celltype dataset cellType_predBySeurat
## AZ_A2 gamma smartseq2 gamma
## AZ_B9 alpha smartseq2 alpha
table(panc_list$query$cellType_predBySeurat)##
## acinar activated_stellate alpha beta
## 192 55 1016 320
## delta ductal endothelial gamma
## 125 440 21 204
## macrophage mast quiescent_stellate schwann
## 7 7 5 2
SingleR is bioconductor package which can be used to predict annotations of a single-cell dataset.
This package uses SingleCellExperiment objects, so we need to convert our Seurat object.
sce_list <- lapply(panc_list, function(seu) {
sce <- as.SingleCellExperiment(seu, assay = "RNA")
return(sce)
})The score is generated comparing the expression levels of a cell in query dataset and the expression pattern of certain group (eg. cell types) in reference dataset. A cell would be assigned as the cell type which has highest score.
library(SingleR)
pred_res <- SingleR(ref = sce_list$ref, test = sce_list$query, labels = sce_list$ref$celltype)
head(pred_res, 2)## DataFrame with 2 rows and 4 columns
## scores labels delta.next pruned.labels
## <matrix> <character> <numeric> <character>
## AZ_A2 0.300904:0.178568:0.436055:... gamma 0.1394303 NA
## AZ_B9 0.318227:0.227997:0.529730:... alpha 0.0861217 alpha
By converting to a matrix, we can check the cell type scoring using a heatmap.
mat <- as.matrix(pred_res$scores)
rownames(mat) <- rownames(pred_res)
pheatmap::pheatmap(mat, scale = "row", show_rownames = FALSE)Lastly we want to add our annotation back to our query dataset.
cell_type <- setNames(pred_res$pruned.labels, rownames(pred_res))
panc_list$query$cellType_predBySingleR <- cell_type[match(Cells(panc_list$query),
names(cell_type))]
head(panc_list$query, 2)## orig.ident nCount_RNA nFeature_RNA tech replicate assigned_cluster
## AZ_A2 AZ 384801 3611 smartseq2 smartseq2 <NA>
## AZ_B9 AZ 654549 4433 smartseq2 smartseq2 <NA>
## celltype dataset cellType_predBySeurat cellType_predBySingleR
## AZ_A2 gamma smartseq2 gamma <NA>
## AZ_B9 alpha smartseq2 alpha alpha
First we can compare this back to the original annotation. We will look for how many overlap.
Seurat annotation:
table(panc_list$query$cellType_predBySeurat == panc_list$query$celltype)##
## FALSE TRUE
## 37 2357
SingleR annotation:
table(panc_list$query$cellType_predBySingleR == panc_list$query$celltype)##
## FALSE TRUE
## 35 2299
Next we can compare our two annotations:
table(panc_list$query$cellType_predBySeurat == panc_list$query$cellType_predBySingleR)##
## FALSE TRUE
## 39 2295
tbl <- table(panc_list$query$cellType_predBySeurat, panc_list$query$cellType_predBySingleR)
pheatmap::pheatmap(tbl, scale = "row")This isn’t always the case. This is good demonstration data.
overview
Often empty droplets can still make their way past Cell Ranger, or we may want to do custom filtering outside of Cell Ranger.
Ambient RNAs from lysed cells can contaminate droplets and make empty droplets seem like they contain a cell.
We are revisiting the PBMC 10k data from 10X Genomics as an example to deal with these issues. This time we will use the raw matrix.
download.file("https://cf.10xgenomics.com/samples/cell-exp/6.1.0/10k_PBMC_3p_nextgem_Chromium_Controller/10k_PBMC_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.tar.gz",
"10k_PBMC_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.tar.gz")untar("10k_PBMC_3p_nextgem_Chromium_Controller_raw_feature_bc_matrix.tar.gz")We need to load in the dataset, but it is important not to set any filters yet.
library(DropletUtils)
library(Seurat)
raw_mtx <- Seurat::Read10X("raw_feature_bc_matrix")
dim(raw_mtx)## [1] 36601 2099284
Calculate the counts for each cell barcode and their rank.
bcrank <- barcodeRanks(raw_mtx)
head(bcrank, 2)## DataFrame with 2 rows and 3 columns
## rank total fitted
## <numeric> <integer> <numeric>
## AAACCCAAGAAACCAT-1 1752547 0 NA
## AAACCCAAGAAACCCA-1 914872 1 NA
To draw a custom knee plot we need to remove duplicated ranks and grab the knee and inflection points.
uniq <- !duplicated(bcrank$rank)
bcrank <- bcrank[uniq, ]
knee <- metadata(bcrank)$knee
message(paste0("knee point: ", knee))
inflection <- metadata(bcrank)$inflection
message(paste0("inflection point: ", inflection))We can now draw our knee plot using ggplot().
ggplot(as.data.frame(bcrank), aes(x = rank, y = total)) + geom_point() + geom_hline(yintercept = knee,
linetype = "dashed", color = "blue") + geom_hline(yintercept = inflection, linetype = "dashed",
color = "green") + scale_x_continuous(trans = "log10") + scale_y_continuous(trans = "log10") +
labs(x = "cell barcode ranked by counts", y = "UMI counts of each cell barcode") +
theme_classic()We can now draw our knee plot using ggplot(). The knee is labelled in blue. The inflection is labelled in green.
We can detect empty droplets with DropletUtils::emptyDrops(). - The p-value for each cell is performed by Monte Carlo simulation basing on the deviation of a given cell to the ambient RNA pool.
The distribution of UMI counts per cell is often a bimodal distribution. The high-UMI group are likely droplets with a cell. In contrast, the droplets with low UMI are more likely empty droplets containing ambient RNAs.
ggplot(as.data.frame(e.out), aes(x = Total)) + geom_histogram() + geom_vline(xintercept = knee,
color = "red", linetype = "dashed") + labs(x = "UMI counts per cell", y = "Frequency") +
scale_y_continuous(trans = "log10") + scale_x_continuous(trans = "log10") + theme_classic()The distribution of UMI counts per cell is often a bimodal distribution. The high-UMI group are likely droplets with a cell. In contrast, the droplets with low UMI are more likely empty droplets containing ambient RNAs.
In the DropletUtils manual, the FDR threshold is set to 0.001 in most cases. We can filter based on this cutoff.
table(e.out$FDR < 0.001)##
## FALSE TRUE
## 611 11574
cell_sel <- rownames(e.out)[e.out$FDR < 0.001]
filt_mtx <- raw_mtx[, colnames(raw_mtx) %in% cell_sel]The cell-free RNA molecules randomly spread in the solution. It can be enclosed into droplets during the steps of library preparation. So, it would be falsely increase the UMI counts of genes.
Ambient RNAs present two signature
According to these two properties, we could estimate ambient RNA contamination rates and correct it. Here we will demonstrate how to correct ambient RNA contamination with DropletUtils and SoupX, respectively.
Lets start by estimating our ambient RNA contamination. This is from DropletUtils.
amb <- metadata(e.out)$ambient[, 1]
head(amb)## AL627309.1 AL627309.3 AL627309.5 AL627309.4 AL669831.2 LINC01409
## 8.369426e-07 9.859354e-08 5.635719e-06 9.859354e-08 9.859354e-08 6.511074e-06
We can read back in our filtered PBMC data.
download.file("https://cf.10xgenomics.com/samples/cell-exp/6.1.0/10k_PBMC_3p_nextgem_Chromium_Controller/10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz",
"10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz")
untar("10k_PBMC_3p_nextgem_Chromium_Controller_filtered_feature_bc_matrix.tar.gz")
filt_mtx <- Seurat::Read10X("filtered_feature_bc_matrix/")Then filter out empty droplets.
filt_mtx_drop <- filt_mtx[rownames(filt_mtx) %in% names(amb), ]
seu <- CreateSeuratObject(filt_mtx_drop)seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
sce <- as.SingleCellExperiment(seu, assay = "RNA")plotUMAP(sce, colour_by = "seurat_clusters")Lets put our uncorrected, filtered dataset in a list.
seu_list <- list()
seu_list[["woCorr"]] <- seuWe can use the to removeAmbience() function to remove ambient RNAs.
out <- removeAmbience(counts(sce), ambient = amb, groups = sce$seurat_clusters)
rownames(out) <- rownames(sce)
colnames(out) <- colnames(sce)Now that we have run a correction, we want to reprocess and cluster our data.
seu_list[["withCorr"]] <- CreateSeuratObject(out)
seu_list[["withCorr"]] <- data_proc(seu_list[["withCorr"]])
seu_list[["withCorr"]] <- ScaleData(seu_list[["withCorr"]])
seu_list[["withCorr"]] <- quick_clust(seu_list[["withCorr"]])Lets compare to our marker genes.
mark_gene <- c("CCR7", "CD8A", "MS4A1", "CD14", "FCGR3A", "FCER1A", "GNLY", "NKG7",
"PPBP")
mark_gene## [1] "CCR7" "CD8A" "MS4A1" "CD14" "FCGR3A" "FCER1A" "GNLY" "NKG7"
## [9] "PPBP"
First lets look at the UMAP without ambient RNA correction.
DimPlot(seu_list$woCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
NoLegend()Lets now compare to UMAP wit ambient RNA correction.
DimPlot(seu_list$withCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
NoLegend()Lets now look at a violin plot in the uncorrected data.
VlnPlot(seu_list$woCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)And then again in the corrected.
VlnPlot(seu_list$withCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)SoupX is an alternative tool for droplet processing, developed for ambient RNA detection and correction link. It includes the following steps - Estimate ambient RNA expression in the empty droplets. - Estimate contamination rate by using the ambient RNA expression levels across clusters. - Correct ambient RNA contamination.
We need our filtered and unfiltered matrix. Plus our processed filtered Seurat object.
clust <- setNames(seu$seurat_clusters, Cells(seu))
seu_filt <- seuImport our matrices into the SoupX channel object.
library(SoupX)
soupOBJ <- SoupChannel(raw_mtx, filt_mtx)
soupOBJ <- setClusters(soupOBJ, clust)The autoEstCont() can be used to detect contamination. We can then export the correct counts with adjustCount().
soupOBJ <- autoEstCont(soupOBJ)
autoCor_mtx <- adjustCounts(soupOBJ)load("data/soup.RData")load("data/auto_soup.RData")We will now put our corrected matrix into a Seurat object.
seu <- CreateSeuratObject(autoCor_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_autoCorr <- seuWe can estimate contamination using known marker genes.
mark_list <- list(`CD4 T-cell` = c("IL7R", "CCR7", "S100A4"), `CD8 T-cell` = c("CD8A"),
`B-Cell` = c("MS4A1"), Monocyte = c("CD14", "FCGR3A"), DC = c("FCER1A"), NK = c("NKG7",
"GNLY"), Platelet = c("PPBP"))
use_toEst <- estimateNonExpressingCells(soupOBJ, nonExpressedGeneList = mark_list)
soupOBJ <- calculateContaminationFraction(soupOBJ, mark_list, useToEst = use_toEst)
rho <- unique(soupOBJ$metaData$rho)
rho## [1] 0.0193281
soupOBJ <- setContaminationFraction(soupOBJ, rho, forceAccept = TRUE)
estCor_mtx <- adjustCounts(soupOBJ)seu <- CreateSeuratObject(estCor_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_estCorr <- seumark_gene <- c("CCR7", "CD8A", "MS4A1", "CD14", "FCGR3A", "FCER1A", "GNLY", "NKG7",
"PPBP")
mark_gene## [1] "CCR7" "CD8A" "MS4A1" "CD14" "FCGR3A" "FCER1A" "GNLY" "NKG7"
## [9] "PPBP"
DimPlot(seu_filt, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) + NoLegend()VlnPlot(seu_filt, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)DimPlot(seu_autoCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
NoLegend()VlnPlot(seu_autoCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)DimPlot(seu_estCorr, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
NoLegend()VlnPlot(seu_estCorr, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)CellBender is a python toolkit developed to eliminate technical artifacts from scRNA-Seq data such as empty droplets and ambient RNAs link. It is run from the command line.
In the current version, it contains remove-background module, which is applied to: * Detect empty droplets and ambient RNAs from raw count matrix in CellRanger output * To remove empty droplets and to correct the interference of ambient RNAs
input_h5=the_raw_matrix_in_h5_format_from_cellranger #essential
output_h5=assign_the_h5_file_path_for_the_cellbender_corrected_matrix # essential
expect_cell=expected_cell_number_can_be_find_in_cellranger_Web_Summary # essential
droplet_num=the_total_number_of_droplets_assigned_while_sequencing # default 25,000
fpr=threshols_of_FALSE_POSITIVE_RATE # default 0.01
epochs=number_of_epochs_to_train # default 150
num_train=number_of_times_to_attempt_to_train_the_model # default 1. would speed up while setting greater
#
cellbender remove-background --input $input_h5 --output $output_h5 --expected-cells $expect_cell --total-droplets-included $droplet_num --fpr $fpr --epochs $epochs --num-training-tries $num_train --cuda FalseWe have some processed results here from CellBender. We can compare this to our filtered matrix from Cell Ranger
cbFilt_mtx <- Read10X_h5("data/cbFilt_PBMCv3_20230324_filtered.h5")
dim(cbFilt_mtx)## [1] 36601 12244
dim(filt_mtx)## [1] 36601 11485
message("processing matrix from CellRanger")
seu <- CreateSeuratObject(filt_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_filt <- seu
message("processing matrix from CellBender")
seu <- CreateSeuratObject(cbFilt_mtx)
seu <- data_proc(seu)
seu <- ScaleData(seu)
seu <- quick_clust(seu)
seu_cbFilt <- seumark_gene <- c("CD3E", "CCR7", "CD8A", "MS4A1", "CD14", "FCGR3A", "FCER1A", "GNLY",
"PPBP")
mark_gene## [1] "CD3E" "CCR7" "CD8A" "MS4A1" "CD14" "FCGR3A" "FCER1A" "GNLY"
## [9] "PPBP"
DimPlot(seu_filt, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) + NoLegend()DimPlot(seu_cbFilt, group.by = "seurat_clusters", pt.size = 0.1, label = TRUE) +
NoLegend()VlnPlot(seu_filt, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)VlnPlot(seu_cbFilt, features = mark_gene, group.by = "seurat_clusters", pt.size = 0)overview
sce <- readRDS("data/sceOBJ_Nestorowa_HSC.rds")
sce## class: SingleCellExperiment
## dim: 46078 1656
## metadata(0):
## assays(2): counts logcounts
## rownames(46078): ENSMUSG00000000001 ENSMUSG00000000003 ...
## ENSMUSG00000107391 ENSMUSG00000107392
## rowData names(3): GENEID SYMBOL SEQNAME
## colnames(1656): HSPC_025 HSPC_031 ... Prog_852 Prog_810
## colData names(5): cell.type FACS sizeFactor label cellType
## reducedDimNames(3): diffusion PCA UMAP
## mainExpName: endogenous
## altExpNames(1): ERCC
To find the trajectories we need to fit our high dimensional data onto a curve. This step is processed with the slingshot link package.
To run slingshot we need: * Our sce object: sce * Clusters: sce$label * Our choice of dimension reduction approach: PCA * To reduce computational load the adjacent 100 cells are aggregated into single point (approx_points=100) * To avoid connecting unrelated trajectories, OMEGA clustering is introduced (omega=TRUE).
library(slingshot)
sce.sling <- slingshot(sce, cluster = sce$label, reducedDim = "PCA", approx_points = 100,
omega = TRUE)colData(sce.sling)[1, 1:5]## DataFrame with 1 row and 5 columns
## cell.type FACS sizeFactor label
## <lgCMatrix> <matrix> <numeric> <factor>
## HSPC_025 FALSE:FALSE:TRUE:... 27675.2:1.17991:1.12999:... 0.499986 3
## cellType
## <character>
## HSPC_025 Erythrocytes
Slingshot has fitted principle curves. We now will embed these cruves in UMAP space embedCurves(). We can see the number o lineages detected at this point.
embedded <- embedCurves(sce.sling, "UMAP")
embedded@metadata$lineages## $Lineage1
## [1] "8" "5" "9" "1" "2" "3" "4"
##
## $Lineage2
## [1] "8" "5" "9" "1" "2" "6"
##
## $Lineage3
## [1] "7"
Once we have lineages, we can then find the position of each cell along these trajectories. This is estimating the pseudotime.
pseudo.paths <- slingPseudotime(sce.sling)
head(pseudo.paths, 2)## Lineage1 Lineage2 Lineage3
## HSPC_025 111.82993 NA NA
## HSPC_031 96.15939 99.77961 NA
It will be useful to combine these times, to find the shared time.
avg_pseudoTime <- rowMeans(pseudo.paths, na.rm = TRUE)
colData(sce.sling)$avg_pseudoTime <- avg_pseudoTimeOnce we have the average pseudotime we can always project these values on to our UMAP.
plotUMAP(sce.sling, colour_by = "avg_pseudoTime")The slingCurves() function can fetch essential information of each principle curve. The results are inside a list, with each principle curve in an element. For each principle curve, we can extract the UMAP data and the order of cells for plotting.
embedded_curve <- slingCurves(embedded)
curve <- lapply(embedded_curve, function(x) {
dat <- data.frame(x$s[x$ord, ]) # UMAP axis and the order of cells
return(dat)
})
names(curve) <- c("curve1", "curve2", "curve3")
head(curve$curve1)## UMAP1 UMAP2
## 1 -12.23250 0.6437628
## 2 -12.01319 0.7200463
## 3 -11.79388 0.7963086
## 4 -11.57455 0.8725464
## 5 -11.35521 0.9487473
## 6 -11.13585 1.0248888
We can now just add our curve information directly to our pseudotime UMAP plot using ggplot parameters geom_path().
plotUMAP(sce.sling, colour_by = "avg_pseudoTime") + geom_path(data = curve$curve1,
aes(x = UMAP1, y = UMAP2), color = "blue") + geom_path(data = curve$curve2, aes(x = UMAP1,
y = UMAP2), color = "black") + geom_path(data = curve$curve3, aes(x = UMAP1,
y = UMAP2), color = "red")The results indicated that lineage 3 is an outlier. If we look back at our lineage information this contains cluster 7.
embedded@metadata$lineages## $Lineage1
## [1] "8" "5" "9" "1" "2" "3" "4"
##
## $Lineage2
## [1] "8" "5" "9" "1" "2" "6"
##
## $Lineage3
## [1] "7"
We can remove cluster 7 and re-evaluate trajectories again. First we rerun the pseudotime estimation.
sce2 <- sce[, colData(sce)$label != 7]
sce.sling2 <- slingshot(sce2, cluster = sce2$label, reducedDim = "PCA", approx_points = 100,
omega = TRUE)
pseudo.paths <- slingPseudotime(sce.sling2)
avg_pseudoTime <- rowMeans(pseudo.paths, na.rm = TRUE)
colData(sce.sling2)$avg_pseudoTime <- avg_pseudoTimeNext we extract out the princple curves again.
embedded <- embedCurves(sce.sling2, "UMAP")
embedded_curve <- slingCurves(embedded)
curve <- lapply(embedded_curve, function(x) {
dat <- data.frame(x$s[x$ord, ]) # UMAP axis and the order of cells
return(dat)
})
names(curve) <- c("curve1", "curve2")plotUMAP(sce.sling2, colour_by = "avg_pseudoTime") + geom_path(data = curve$curve1,
aes(x = UMAP1, y = UMAP2), color = "blue") + geom_path(data = curve$curve2, aes(x = UMAP1,
y = UMAP2), color = "black")So far we have visualized the lineages together. We will want to break this down and compare to clusters. First we will look at clusters.
plotUMAP(sce.sling2, colour_by = "label")We can reveal which cluster belong to each cluster and lineage 1’s trajectory.
embedded@metadata$lineages$Lineage1## [1] "8" "5" "9" "1" "2" "3" "4"
plotUMAP(sce.sling2, colour_by = "slingPseudotime_1") + geom_path(data = curve$curve1,
aes(x = UMAP1, y = UMAP2))We can reveal which cluster belong to each cluster and lineage 2’s trajectory.
embedded@metadata$lineages$Lineage2## [1] "8" "5" "9" "1" "2" "6"
plotUMAP(sce.sling2, colour_by = "slingPseudotime_2") + geom_path(data = curve$curve2,
aes(x = UMAP1, y = UMAP2))Driver genes were the genes whose expression level changes reflect the pseudotime changes. We can find these using testPseudotime() from TSCAN. Using this the gene expression and pseudotime are fitted with a non-linear model, followed by testing with ANOVA.
We will start by just looking at lineage 1.
library(TSCAN)
res <- testPseudotime(sce.sling2, sce.sling2$slingPseudotime_1)We can see the results. Often we will want to reorder it too prioritize the biggest changers.
res$SYMBOL <- rownames(res)
res <- res[order(-res$logFC), ]
head(res)## DataFrame with 6 rows and 4 columns
## logFC p.value FDR SYMBOL
## <numeric> <numeric> <numeric> <character>
## ENSMUSG00000021728 0.0678996 0.00000e+00 0.00000e+00 ENSMUSG00000021728
## ENSMUSG00000016494 0.0654681 0.00000e+00 0.00000e+00 ENSMUSG00000016494
## ENSMUSG00000040747 0.0639249 7.23447e-296 5.16939e-293 ENSMUSG00000040747
## ENSMUSG00000057729 0.0619007 0.00000e+00 0.00000e+00 ENSMUSG00000057729
## ENSMUSG00000021998 0.0610830 0.00000e+00 0.00000e+00 ENSMUSG00000021998
## ENSMUSG00000090164 0.0608169 4.81695e-237 1.76510e-234 ENSMUSG00000090164
We can then categorize our data based on significance and FC.
res$stat <- NA
res$stat[res$logFC > 0 & res$FDR < 0.05] <- "late"
res$stat[res$logFC < 0 & res$FDR < 0.05] <- "early"
res$stat[is.na(res$stat)] <- "nc"
res[1:2, ]## DataFrame with 2 rows and 5 columns
## logFC p.value FDR SYMBOL stat
## <numeric> <numeric> <numeric> <character> <character>
## ENSMUSG00000021728 0.0678996 0 0 ENSMUSG00000021728 late
## ENSMUSG00000016494 0.0654681 0 0 ENSMUSG00000016494 late
We can quickly and easily see how many are upregulated/downregulated across the trajectory.
table(res$stat)##
## early late nc
## 11261 8848 25969
Using the order we can easily isolate the driver genes. Here we can see the top 5 early genes.
sub <- res[res$stat == "early", ]
sub <- sub[order(sub$logFC), ]
top <- head(sub$SYMBOL, 5)
top## [1] "ENSMUSG00000054191" "ENSMUSG00000004655" "ENSMUSG00000027556"
## [4] "ENSMUSG00000031877" "ENSMUSG00000031162"
With plotExpression() we can look at the expression of specific genes i.e. our top genes. We can see how each gene is expressed across all cells and psuedotime.
plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
colour_by = "label")We can now repeat this for the top 5 late genes.
sub <- res[res$stat == "late", ]
sub <- sub[order(-sub$logFC), ]
top <- head(sub$SYMBOL, 5)
plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
colour_by = "label")We can now repeat this for the top 5 late genes.
Now lets repeat this whole process for lineage 2.
res <- testPseudotime(sce.sling2, sce.sling2$slingPseudotime_2)
res$SYMBOL <- rownames(res)
res <- res[order(-res$logFC), ]
res$stat <- NA
res$stat[res$logFC > 0 & res$FDR < 0.05] <- "late"
res$stat[res$logFC < 0 & res$FDR < 0.05] <- "early"
res$stat[is.na(res$stat)] <- "nc"
table(res$stat)##
## early late nc
## 11566 8450 26062
sub <- res[res$stat == "early", ]
sub <- sub[order(sub$logFC), ]
top <- head(sub$SYMBOL, 5)
plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
colour_by = "label")sub <- res[res$stat == "late", ]
sub <- sub[order(-sub$logFC), ]
top <- head(sub$SYMBOL, 5)
plotExpression(sce.sling2, features = top, swap_rownames = "GENEID", x = "slingPseudotime_1",
colour_by = "label")The CITE-Seq method labels different types of cells or different samples with hashtag-labeled antibodies. Then, the cells can be pooled together into a single library and be sequencing at the same time. After sequencing, the cells can be separated by the different antibody hashtags.
rna.mat <- readRDS("data/pbmc_umi_mtx.rds")
dim(rna.mat)## [1] 27117 16916
hto.mat <- readRDS("data/pbmc_hto_mtx.rds")
dim(hto.mat)## [1] 8 16916
rownames(hto.mat)## [1] "HTO_A" "HTO_B" "HTO_C" "HTO_D" "HTO_E" "HTO_F" "HTO_G" "HTO_H"
seu_obj <- CreateSeuratObject(counts = rna.mat, project = "citeSeq_demo")
seu_obj[["HTO"]] <- CreateAssayObject(counts = hto.mat)
seu_obj## An object of class Seurat
## 27125 features across 16916 samples within 2 assays
## Active assay: RNA (27117 features, 0 variable features)
## 1 other assay present: HTO
DefaultAssay(seu_obj) <- "RNA"
seu_obj <- data_proc(seu_obj)
seu_obj <- ScaleData(seu_obj)seu_obj <- quick_clust(seu_obj)
DimPlot(seu_obj, group.by = "seurat_clusters", pt.size = 0.2, label = TRUE) + NoLegend()RidgePlot(seu_obj, group.by = "hash.ID", assay = "HTO", features = rownames(seu_obj[["HTO"]])[1:2],
ncol = 2)RidgePlot(seu_obj, group.by = "hash.ID", assay = "HTO", features = rownames(seu_obj[["HTO"]])[3:4],
ncol = 2)